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Introduction 

Since  the  early  work  with  Perceptrons  the  study  of  learning  systems  has  had  an  important 
place  in  artificial  intelligence.  The  term  'learning'  of  course  covers  a range  of  behavior.  At 
one  extreme  there  Is  the  adjustment  of  numerical  parameters  that  characterises  for  instance 
most  of  spatial  pattern  recognition  [Nilsson  1965].  At  the  other  we  now  have  programs  such 
as  AM  [Lenat  1977]  that  not  only  acquire  new  structures  but  also  the  conceptual  framework 
on  which  to  hang  them.  Somewhere  between  lies  the  area  known  as  Induction.  It  has  two 
general  characteristics:  Its  aim  Is  to  discover  structured  Information  about  some  collection 
of  entities,  and  the  methodology  employed  is  the  analysis  of  many  examples  of  the  genre 
called  Instances.  Thus  learning  programs  such  as  [Quinlan  1976]  are  not  induction  systems 
because  tne  Information  gleaned  from  Instances  has  little  structure.  Neither  are  those  such 
as  TIERESIAS  [Davis  1977]  because,  although  they  find  or  modify  structures,  they  do  so  by 
metalevel  Interaction  with  an  expert  rather  than  by  the  examination  of  Instances. 

As  with  most  areas  In  Al,  Induction  systems  can  be  divided  Into  those  that  use  only  general 
methods  and  so  are  applicable  to  any  problem,  and  those  designed  for  specific  tasks.  Although 
It  Is  not  Intended  to  attempt  a survey  here,  a short  discussion  of  a couple  of  existing  systems 
should  highlight  important  concepts.  Much  more  substantial  pointers  into  the  literature  may  be 
found  In  [Hayes-Roth  and  McDermott  1977],  [Buchanan  at  al.  1977]  and  [Mlchalskl  1978], 


The  work  reported  here  was  performed  while  the  author  was  visiting  the  Computer  Science 
Department  of  Stanford  University,  and  using  facilities  made  available  there  by  the  Artificial 
Intelligence  Laboratory  and  the  Heuristic  Programming  Project. 
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The  earliest  programs  dating  from  the  late  fifties  were  the  series  of  Concept  Learning 
Systems  (CLSs),  an  extensive  exploration  of  which  may  be  found  In  [Hunt  et  al.  1080].  They 
were  general-purpose  programs  although  they  found  particular  application  In  finding  rules  for 
pattern  classification.  Each  Instance  presented  to  them  is  described  as  a vector  of  discrete 
values  for  a fixed  number  of  attributes  or  properties,  and  all  Instances  must  have  this  Identi- 
cal format.  The  output  Is  a decision  tree  relating  one  of  the  attributes  (normally  a class)  to 
the  values  of  the  others.  This  tree  Is  constructed  top-down,  starting  with  all  Instances  In  a 
single  set  and  seeking  successively  finer  partitions  of  this  set  until  each  block  of  the  partition 
contains  only  Instances  of  a single  class.  More  details  of  this  approach  will  be  given  later. 

A more  recent  system  In  the  general-purpose  category  Is  INDUCE  [Michalskl  1078].  Here 
each  Instance  Is  represented  as  a series  of  assignments  of  properties;  for  example  the  frag- 
ment 

part:  - PI,  color(PI):  <=  blue,  part:  = P2,  c olor(PZ):  « yellow 

would  Indicate  that  the  object  being  described  has,  among  other  things,  a blue  part  and  a 
yellow  part.  Instances  need  not  all  have  the  same  number  of  components  or  properties,  so 
there  Is  quite  a lot  more  denotatlonal  freedom  than  provided  by  the  rigid  vector  of  CLS.  The 
output  from  INDUCE  Is  a collection  of  decision  rules  couched  in  a quantified  multi-valued  logic, 
and  may  employ  descriptors  that  appear  nowhere  in  the  specification  of  the  Instances  such 
as 

If  there  are  3 or  more  red  parts  then  It's  a fire  engine 

Unlike  CLS,  INDUCE  works  bottom-up;  starting  from  a collection  of  positive  Instances  of  a 
class  and  one  of  negative  (counter-)examples  It  attempts  to  build  successively  more  complex 
descriptors  that  cover  the  positive  but  not  the  negative  collections.  Although  this  system  does 
Indeed  seem  promising,  only  partial  implementations  of  it  have  been  developed. 

There  are  a few  systems  of  Intermediate  generality,  limited  either  by  the  sort  of  In- 
stances that  can  be  analysed  or  the  relations  discovered.  (For  instance  Thoth-p  [Vere  1078] 
Is  designed  to  study  'before  and  after'  or  sequences  of  snapshots  to  find  the  actions  respon- 
sible for  the  changes.  The  actions  are  represented  by  relational  productions  containing  simple 
sets  of  first-order  formulae  to  represent  context,  relations  destroyed  and  relations  created.) 
However  most  other  Induction  programs  are  strongly  related  to  some  task  area.  The  most 
widely-known  of  these  must  surely  be  Meta-DENDRAL  [Buchanan  et  al.  1078]  which  discovers 
rules  for  Identifying  organic  chemicals  from  readings  of  an  experimental  device  (originally  a 
mass  spectrometer).  The  instances  It  analyses  are  triples  consisting  of  the  graph  of  a chemical 
compound,  a fragment  mass  and  its  relative  abundance.  The  rules  sought  are  predictions  of 
how  chemical  bonds  will  break  and  atoms  migrate  when  Irradiated  with  high-energy  particles. 
In  forming  trial  rules  this  system  Is  guided  by  a weak  model  of  what  can  and  cannot  happen  In 
the  Instrument;  the  final  rules  represent  a much  stronger  model.  This  work  Is  of  Interest  on  at 
least  three  counts.  The  use  of  a weak  model  as  a sort  of  plausible  move  generator  Is  patently 
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Figure  1 : A counterexample 


a good  Idea.  Sets  of  rules  that  It  produced  have  been  published  as  chemistry,  attesting  to 
their  power.  Finally,  It  seems  to  be  one  of  the  few  systems  that  has  demonstrated  an  ability 
to  cope  with  large  volumes  of  data,  which  Is  the  particular  concern  of  this  paper. 

During  his  visit  to  Stanford  In  the  fall  of  1678,  Donald  Michie  discussed  the  following 
chess  problem.  We  consider  an  endgame  situation  in  which  the  pieces  have  been  reduced 
to  a black  knight,  a white  rook  and  the  two  kings,  and  where  it  Is  black's  turn  to  move.  We 
wish  to  know  whether  black  is  safe  for  at  least  one  more  white  move.  (The  complete  endgame 
situation  Is  discussed  fully  in  [Michie  et  al.  1078].)  Safety  here  has  to  do  with  keeping  the 
game  alive  a little  longer.  If  black  can  capture  the  rook  or  achieve  stalemate  then  the  game 
Is  drawn,  the  best  possible  outcome  from  black's  point  of  view.  A more  exact  specification  Is 

• If  the  given  situation  Is  mate  then  black  is  not  safe;  if  it  is  stalemate  black  Is  safe. 

• Otherwise,  black  is  not  safe  2-ply  If  he  cannot  capture  the  rook  and  no  matter  what 
move  he  makes  white  can  either  mate  him  or  capture  the  knight  without  stalemate 
or  leaving  the  rook  en  prise. 

Naturally  this  question  can  be  resolved  easily  by  search.  The  task  however  was  to  answer  It  In 
terms  only  of  some  static  description  of  the  Initial  situation.  Simple  tabulation  of  all  cases  was 
ruled  out  because,  even  after  symmetry  has  been  taken  Into  account,  there  are  nearly  two 
million  possible  configurations;  the  description  of  a situation  had  thus  to  characterise  rather 
than  define  It.  At  least  four  people  attempted  the  problem  by  hand,  but  all  'solutions'  turned 
out  to  contain  errors  resulting  from  plausible  but  unjustified  assumptions.  For  instance  it  seems 
reasonable  to  assert  that  black  Is  safe  at  least  In  those  cases  where  neither  the  black  king 
nor  black  knight  are  threatened,  but  Figure  1 shows  a counterexample.  It  Is  not  surprising  then 
that  attempts  to  discover  s rule  using  sn  induction  system  met  with  considerable  difficulty. 


In  the  course  of  studying  this  and  related  problems,  certain  techniques  were  developed 
that  seemed  to  have  more  general  applicability.  These  are  Introduced  in  the  next  section.  A 
particular  induction  task  and  algorithm  are  specified  and  two  succeeding  sections  discuss  In 
some  detail  the  experiments  carried  out.  These  and  additional  results  are  assessed  In  the 
conclusion. 

General  Description  of  the  Approach 

The  essence  of  the  Induction  task  Is  discovery.  The  data  for  the  task  Is  a collection  of 
Instances  or  descriptions  of  some  set  of  entities  in  terms  of  their  properties.  A rule  Is  some 
method  of  explaining  an  Instance  by  establishing  some  relationship  between  these  properties. 
The  induction  task  Is  to  discover  a rule  adequate  to  explain  each  Instance  In  the  data.  For 
example,  one  common  type  of  Induction  task  is  the  discovery  of  a rule  for  classifying  patterns. 
The  data  Is  usually  referred  to  as  a training  set,  and  one  property  of  each  instance  Is  Its  class 
membership.  The  Induction  task  in  this  case  is  to  discover  a rule  that  relates  the  class  of  each 
Item  In  the  given  training  set  to  its  other  properties,  and  hopefully  also  predicts  the  class  of 
many  items  in  the  entire  description  space  from  which  the  training  set  was  derived. 

Several  systems  capable  of  discovering  such  rules  to  explain  a set  9 of  Instances  were 
cited  In  the  Introduction.  Their  workings  have  little  in  common,  but  it  seems  safe  to  hypothesize 
that  any  system  for  this  task  must  examine  many  of  the  Instances  In  9 many  times,  and  that 
the  order  in  which  instances  are  referenced  Is  not  fixed.  This  is  fine  so  long  as  the  size  of  9 
does  not  preclude  Its  storage  In  fast  ra ndom-access  memory.  Many  training  sets  of  Interest, 
however,  are  too  large  to  store  in  this  way,  and  the  performance  of  any  Induction  system  could 
be  expected  to  wilt  If  each  reference  to  an  instance  required  a disk  operation  or  some  such. 

One  way  around  this  obstacle  would  be  to  select  a manageable  subset  of  9,  form  a rule 
to  explain  this  subset,  and  hope  that  the  rule  Is  also  adequate  to  explain  9.  As  mentioned 
above,  this  is  a common  practice  in  pattern  recognition,  but  commonsense  (backed  by  a few 
experiments)  Indicates  that,  as  the  rule  necessary  to  explain  9 becomes  more  complex  and 
the  training  set  becomes  a smaller  subset  of  9,  the  probability  of  a rule  derived  to  explain 
the  subset  also  being  adequate  for  all  of  9 approaches  zero.  As  an  Illustration,  consider  the 
problem  of  discovering  the  rules  of  a two-person  game  from  examining  a random  sample  of 
moves  In  context.  For  a game  such  as  tic-tac-toe  quite  a small  collection  of  moves  would 
probably  suffice;  In  the  case  of  chess  the  sample  would  have  to  be  enormous  before  there 
was  near  certainty  that  each  possible  move  pattern  was  represented  In  It,  as  a consequence 
both  of  the  number  of  such  patterns  and  the  complexity  of  the  context  affecting  their  legality. 

Another  method  Is  to  break  9 into  chunks,  each  small  enough  to  fit  In  primary  memory, 
and  to  process  the  chunks  In  a fixed  sequence.  This  is  also  effective,  but  It  requires  a more 
complex  algorithm  and  may  lose  advantages  that  could  have  arisen  from  the  Juxtaposition  of 
certain  Instances  which  may  now  be  spread  over  several  chunks.  As  an  analogy,  If  A appears 
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In  one  chunk  and  J.  ■»  (B  In  another,  the  task  of  discovering  3 becomes  more  taxing  than  If 
both  the  above  facts  had  been  found  together. 

Hunt  and  various  co-workers  [Hunt  et  al.  1 966,  Diehr  and  Hunt  1 968]  have  also  addressed 
this  problem,  though  from  a somewhat  different  perspective.  Suppose  that  memory  Is  limited 
to  slots  lor  some  fixed  number  of  Instances.  One  method  they  proposed  was  to  scan  the 
data  base  in  cycles,  each  Instance  being  inspected  once  every  cycle.  The  instance  was  read 
Into  a randomly  chosen  slot  (thus  destroying  the  record  of  any  Instance  previously  occupying 
that  slot)  and,  If  the  current  rule  was  not  adequate  to  explain  this  Instance,  a new  rule  was 
Immediately  developed  from  only  those  Instances  currently  occupying  the  slots.  Thus  many 
rules  could  be  generated  in  the  course  of  a single  cycle,  and  many  cycles  could  be  necessary 
to  yield  a sufficiently  correct  rule  (since  this  work  did  not  require  that  the  final  rule  be  exact). 
A more  sophisticated  development  saved  memory  space  by  storing  Incomplete  Instances.  If 
the  current  rule  explained  an  Instance  correctly,  then  only  those  of  Its  properties  used  by  the 
current  rule  were  stored  for  that  instance;  if  the  current  rule  was  Incorrect  then  the  entire 
instance  was  saved.  Thus  the  available  memory  could  at  any  time  contain  a mixture  of  partially 
and  completely  specified  Instances.  This  approach  was  able  to  produce  quite  accurate  rules 
using  very  little  memory  (e.g.  95%  accuracy  using  memory  sufficient  for  about  5%  of  the  data 
base).  However  the  sample  problems  studied  were  extremely  simple  — the  data  base  typically 
contained  256  Instances,  each  described  by  four  attributes,  and  a rule  for  determining  the 
class  to  which  an  instance  belonged  needed  to  test  only  one  or  two  of  the  attributes.  This 
technique  does  not  seem  to  hove  been  tested  on  a problem  of  more  realistic  complexity. 

This  paper  proposes  another  approach  that  does  not  require  changes  to  the  Inductive 
algorithm  being  used,  but  Instead  attempts  to  separate  out  a distinguished  subset  of  3.  The 
whole  of  *5  Is  presumed  to  be  able  to  be  accessed  sequentially,  such  as  by  being  held  on 
some  secondary  storage  medium.  Note  that,  using  buffering  methodologies,  It  Is  still  possible  to 
examine  the  elements  of  3 with  an  average  access  time  very  close  to  that  of  primary  memory, 
but  only  If  we  examine  a large  number  of  them  and  In  a fixed  sequence.  The  special  subset,  on 
the  other  hand,  Is  distinguished  by  our  ability  to  reference  Individual  Instances  In  It  randomly 
and  quickly;  It  Is  thus  envisaged  that  this  much  of  3 will  fit  into  primary  memory.  The  focus  of 
this  paper  Is  the  development  of  iterative  techniques  for  selecting  from  3 the  subset  that  can 
be  readily  manipulated.  At  any  time  the  inductive  algorithm  will  be  able  to  'see'  the  total  data 
base  only  through  this  subset,  which  Is  referred  to  as  a window. 

The  Idea  underlying  the  scheme  is  as  follows.  Let  WC^3  be  a window  into  3.  Whatever 
induction  system  we  are  using  should  be  able  to  produce  a rule  tfc  that  explains  W — If  not  then 
It  Is  hardly  likely  to  be  able  to  find  one  to  explain  3.  Unless  this  rule  ffc  Is  a case-by-case 
tabulation  of  V/  (which  most  people  would  not  regard  as  an  explanation  of  W),  It  will  contain 
structures  that  also  explain  Instances  not  In  V,  and  In  particular  some  Instances  In  3 — Hf.  In 
general  fft  will  not  be  sufficient  to  explain  all  of  3;  ffc  applied  to  3 — If  will  partition  It  Into  a 
set  of  Instances  corroborating  9L  and  a set  of  exceptions  for  which  tfc  offers  no  explanation 
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or  for  which  Its  explanation  Is  Incorrect.  Now  If  *A  Is  viewed  as  an  erroneous  explanation 
mechanism  that  must  be  debugged,  then  these  exceptions  (l.e.  the  bugs)  would  seem  to 
be  more  Important  than  the  corroborating  Instances.  They  may  be  discovered,  of  course,  by 
examining  all  Instances  of  3 one  at  a time  in  some  arbitrary  order;  as  noted  above  this  can  be 
performed  out  of  secondary  memory  without  a significant  Increase  In  the  average  access  time. 
The  suggestion  then  Is  to  start  by  selecting  W at  random  from  3,  and  to  form  a new  window 
from  the  old  window  and  the  exceptions  to  the  rule  formed  to  explain  that  window.  The  old  rule 
Is  then  discarded,  a new  rule  Is  found  to  explain  the  new  window,  and  the  process  repeats 
until  a rule  Is  discovered  that  has  no  exceptions  and  so  explains  all  of  3.  Note  that  only  one 
rule  Is  developed  on  each  Iteration,  and  that  the  final  rule  is  exact  (although  the  process  could 
of  course  be  halted  as  soon  as  the  number  of  exceptions  fell  below  some  threshhold). 

The  most  obvious  way  of  forming  a new  window  from  the  previous  one  and  a set  of  ex- 
ceptions would  be  simply  to  merge  them.  However  these  exceptions  may  be  redundant  In  the 
same  sense  that  the  original  data  base  3 Is  redundant.  The  window  would  then  grow  rapidly, 
leading  to  the  same  sort  of  storage  troubles  that  the  whole  approach  Is  supposed  to  overcome. 
Instead  the  first  method  places  a limit  (called  the  exceptions  limit)  on  the  number  of  exceptions 
that  can  be  added  to  the  window  at  each  Iteration.  Ideally  this  limit  should  be  neither  too  high 
(In  which  case  the  window  grows  fat)  nor  too  low  (when  an  Inordinate  number  of  Iterations 
would  be  required  because  the  window  changes  so  slowly).  This  question  Is  explored  later. 

The  first  method  above  still  suffers  from  the  disadvantage  that  the  size  of  the  window 
Increases  at  each  Iteration.  It  could  happen  that  at  some  stage  the  window  occupies  all 
available  storage  and  the  process  must  then  terminate  as  no  additional  exceptions  can  be 
Incorporated.  The  second  method  forms  a new  window  by  blending  Instances  from  the  old 
window  with  exceptions  in  such  a way  that  the  window  size  remains  constant.  An  attempt  Is 
made  to  ensure  that  at  least  one  Instance  corresponding  to  each  subcomponent  of  the  old 
rule  Is  Included  In  the  new  window  so  that  sections  of  the  old  rule  that  may  be  correct  are 
not  forgotten.  Here  also  It  seems  sensible  to  restrict  the  number  of  exceptions  added  at  each 
Iteration,  not  to  stop  the  window  growing  too  rapidly,  but  rather  to  prevent  the  old  window 
being  swamped,  anti  thus  essentially  discarded,  each  Iteration. 

Two  properties  of  both  methods  are  worth  noting.  In  each  case  the  window  changes  only 
slightly  when  there  are  few  exceptions  to  the  rule  derived  from  It.  As  the  rule  developed 
from  a window  approaches  a correct  rule  for  all  of  the  data  base  9,  the  next  rule  Is  thereby 
constrained  to  be  similar.  On  the  other  hand  when  there  are  many  exceptions,  the  window 
and  the  rule  generated  from  it  can  change  more  rapidly.  The  rate  of  change  of  the  rule  thus 
agrees  with  an  Intuitive  approach  to  hill-climbing.  However  the  methods  are  heuristic  In  nature 
and  neither  can  guarantee  convergence  on  a correct  rule.  The  first  will  stop  when  the  window 
size  reaches  the  bounds  of  the  storage  available  for  It,  so  at  least  It  will  always  terminate. 
The  second,  though,  may  search  forever  attempting  to  construct  a window  of  fixed  size  from 
which  a correct  rule  for  all  of  3 can  be  induced.  If  there  is  no  such  subset  of  3— for  instance 
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If  the  number  of  subcomponents  of  any  correct  rule  exceeds  the  number  of  Instances  allowed 
In  the  window— then  the  search  Is  doomed  to  failure. 


The  Problem  Attempted 

The  proceeding  section  outlined  an  Iterative  technique  for  discovering  a rule  to  explain 
a data  base,  and  two  methods  of  forming  a new  window  as  part  of  this  technique.  The  problem 
chosen  as  a vehicle  for  studying  them  had  to  have  a number  of  characteristics.  First,  It  had 
to  be  real.  It  would  have  been  easy  to  create  an  artificial  problem  by  choosing  a set  of 
'attributes'  and  arbitrarily  assigning  subregions  of  the  space  so  created  to  classes.  This  would 
Inevitably  build  Into  the  task  preconceptions  of  what  such  a task  should  look  like.  Secondly, 
a substantial  data  base  was  necessary  because  the  techniques  are  primarily  concerned  with 
problems  of  storage  that  do  not  arise  with  small  training  sets.  The  rule  sought  had  to  be  non- 
trivial, because  difficulty  could  be  expected  to  magnify  the  difference  between  a sound  and 
an  unsound  approach.  But  In  spite  of  all  these  the  task  had  to  be  of  manageable  proportions. 
It  was  anticipated  that  the  number  of  trials  necessary  to  evaluate  the  approach  would  be  in 
the  hundreds,  and  although  the  computer  facilities  generously  made  available  were  substantial 
they  were  not  Inexhaustible. 

The  task  settled  on  was  related  to  the  chess  problem  posed  by  Donald  Mlchle,  but  scaled 
down  In  complexity.  Again  there  are  only  four  pieces  on  the  board  (black  king  and  knight,  white 
king  and  rook)  with  black's  turn  to  move.  The  question  asked  Is  this: 

Given:  a position  (neither  stalemate  nor  checkmate)  In  which  the  black  knight  Is 
pinned  or  skewered  by  the  rook,  or  the  rook  has  achieved  a linear  fork  of  the  black 
knight  and  king.  After  black's  move,  can  white  Immediately  capture  the  knight  without 
either  leaving  the  rook  en  prise  or  causing  stalemate? 

Two  examples  are  shown  In  Figure  2.  In  the  left-hand  one  the  black  king  must  move  and, 
whatever  he  does,  the  rook  can  then  capture  the  knight  so  the  knight  Is  lost.  The  right-hand 
case  seems  similar,  but  this  time  the  black  king  can  move  to  the  corner.  If  he  does  so  the 
rook  cannot  capture  the  knight  without  causing  stalemate,  so  the  knight  Is  safe.  As  these 
examples  show,  determining  whether  the  knight  Is  safe  or  lost  two-ply  under  these  conditions 
still  requires  care.  Note  that  the  original  problem  has  been  restricted  In  two  ways.  The  Initial 
position  Is  a pin,  (linear)  fork  or  skewer  instead  of  any  arbitrary  configuration  of  the  pieces; 
this  reduces  the  size  of  the  data  base.  Secondly,  we  are  Ignoring  the  possibility  that  white 
can  mate  on  his  next  move  without  capturing  the  knight;  this  reduces  the  complexity  of  the 
rule  that  must  be  discovered. 


Fourteen  quasl-geometrlc  attributes  relevant  to  this  task  were  defined.  Three  examples 
should  give  their  flavor: 
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Figure  2:  Two  examples  of  initial  positions 


• The  distance  In  king  moves  from  the  black  king  to  the  white  rook,  with  possible 
values  1,  2 and  more  than  2.  If  this  distance  Is  1 or  2 there  Is  a possibility  that  the 
black  king  can  move  to  threaten  the  rook. 

• Whether  or  not  the  block  knight  occupies  a square  one  away  from  a corner  and  on 
a diagonal.  As  Figure  2 illustrates,  this  con  be  important  In  deciding  whether  or  not 
black  can  threaten  stalemate. 

• Whether  or  not  the  black  king  can  move  so  that  it  Is  next  to  the  knight,  and  thus 
defending  It.  This  Is  the  least  geometric  of  all  the  attributes,  but  It  can  still  be 
thought  of  In  terms  of  set  operations  rather  than  search.  If  we  denote  the  set  of 
positions 

adjacent  to  the  black  king  by  bk 
adjacent  to  the  white  king  by  wk 
adjacent  to  the  black  knight  by  bn 
In  the  white  rook's  rank  or  file  by  wr 

ditto  but  with  the  knight  between  them  and  the  rook  by  shadow 
then  this  attribute  can  be  expressed  as 

bk  P)  bn  P|  (— wk)  Q ( (—wr)  (J  shadow ) Is  non-empty 

Nine  of  these  attributes  had  two  permissible  values,  five  of  them  had  three. 

All  possible  Initial  chess  positions  as  above  were  generated  and  the  values  of  the  fourteen 
attributes  plus  the  class  (lost  or  safe ) determined.  These  attributes  do  not  define  a position 
uniquely:  many  different  positions  can  map  into  a single  vector  of  attribute  values,  so  the 
number  of  distinct  vectors  Is  a small  proportion  of  the  number  of  possible  positions.  Nor  are 
they  Independent:  most  of  the  2®  X 35  points  In  the  attribute  space  do  not  correspond  to  any 
position.  In  this  case  It  was  found  that 
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• No  two  positions  with  the  same  vector  of  attribute  values  belonged  to  different 
classes,  so  the  attributes  are  adequate  for  this  task.  Had  this  not  been  the  case, 
there  would  have  been  positions  for  which  it  would  not  have  been  possible  to  tell 
whether  the  knight  was  lost  or  safe  from  the  values  of  the  attributes  alone,  and  so 
the  task  of  explaining  the  data  base  wou,d  have  been  impossible. 

• There  are  1,987  distinct  vectors. 

The  data  base  (hereafter  referred  to  as  pf&s)  was  taken  as  this  entire  collection  of  nearly 
two  thousand  instances.  Since  It  Is  complete,  any  rule  that  explains  It  gives  a precise  answer 
to  the  question  posed  above.  The  complexity  of  an  exact  rule  couched  In  terms  of  these 
attributes  Is  brought  out  by  the  Appendix,  which  gives  a full  specification  of  the  attributes 
and  the  simplest  known  correct  rule  in  the  form  of  a decision  tree. 

It  Is  Important  to  realise  that  the  game  of  chess  has  no  central  role  In  this  evaluation, 
but  merely  serves  as  the  source  of  a non-trivial  task  that  can  readily  be  understood  by  human 
beings.  Neither  the  Induction  algorithm  nor  the  iterative  technique  being  investigated  have 
anything  pertaining  to  chess  associated  with  them.  As  far  as  the  programs  are  concerned, 
this  Is  a task  of  Inducing  a complex  class-prediction  rule  from  a data  base  of  two  thousand 
elements  described  by  fourteen  other  attributes.  Any  task  of  similar  dimensions  would  be 
handled  similarly,  whether  It  arose  In  the  context  of  analysing  bubble-chamber  photographs 
or  picking  the  winner  of  the  Melbourne  Cup. 

The  Induction  Algorithm  Used 

The  Iterative  technique  proposed  earlier  Is  clearly  not  tied  to  any  one  Induction  algorithm. 
Its  only  Interaction  with  the  latter  Is  to  provide  a subcollection  of  Instances  and  to  receive  back 
a rule  adequate  to  explain  the  subcollection.  It  should  be  possible  to  replace  the  algorithm  used 
In  this  set  of  experiments  with  another  and  obtain  comparable  results.  In  fact  the  algorithm 
used  here  Is  a simple  and  unsophisticated  relative  of  Hunt's  CIS.  The  virtues  of  our  version 
lie  in  the  direction  of  compactness  and  efficiency  rather  than  'Intelligence',  so  If  anything  a 
more  advanced  algorithm  should  give  better  results. 

CLS-llke  systems  require  that  each  Instance  be  specified  In  the  same  fairly  rigid  format. 
We  Imagine  some  fixed  list  of  attributes  each  having  a (usually  small)  finite  number  of  permis- 
sible values.  For  example  color  of  eyes  might  be  an  attribute  with  permissible  values  blue, 
green,  brown,  black  and  other.  A rule  will  relate  one  of  these  attributes  to  values  of  the  others. 
The  attribute  so  distinguished  Is  normally  called  a class  from  the  fact  that  this  approach  was 
developed  in  a pattern-recognition  context. 

Suppose  we  have  a collection  C of  Instances  whose  class  we  wish  to  relate  to  the  other 
attributes.  Three  cases  arise.  If  C Is  empty  then  we  can  offer  no  explanation  of  It.  If  all 
members  of  C are  of  the  same  class  we  can  relate  them  to  this  class  without  further  testing. 
Otherwise  we  select  some  attribute  J.  whose  permissible  values  are  JLn  (say)  and 

9 


N*  'V 


partition  C Into  subcollectlona  Ci,Cj,...C»  where  Ci  contains  the  members  of  C that  have  value 
J.4  of  X-  Graphically  we  have  a structure 


attribute 


X\  Xj  Xa  ...  Xn 


Cj  Cj  Cj  ...  Cn 

where  each  branch  corresponding  to  a value  of  X contains  a subcollection  of  instances.  We 
then  do  the  obvious  recursive  thing  of  applying  the  same  process  to  each  Ct  In  turn.  The  result 
Is  a tree.  Each  Internal  node  causes  us  to  branch  on  the  value  of  some  attribute.  Each  leaf  is 
either  null  because  no  Instances  In  the  training  set  correspond  to  It,  or  contains  a collection  of 
Instances  of  a single  class  which  we  will  call  the  assigned  class  of  that  leaf.  This  tree  is  Itself 
the  rule  that  we  seek,  because  it  maps  any  instance  to  a leaf  which  either  has  an  assigned 
class  or  Is  null. 

The  trick  building  this  tree  is  In  deciding  at  each  stage  which  attribute  X to  branch  on. 
Certain  attributes  can  immediately  be  exclued  from  consideration.  If  all  but  one  of  the  C<'s 
are  empty  then  it  would  be  pointless  to  select  this  attribute;  this  filter  automatically  prevents 
testing  the  same  attribute  twice.  There  is  still  quite  a scope  for  choice,  which  can  be  made 
in  many  ways.  As  originally  suggested  by  Hunt,  we  could  use  a system  of  costs.-  measurement 
costs  for  determining  the  value  of  an  attribute,  and  misclasslflcation  costs  estimated  by  a 
type  of  lookahead  procedure.  Using  this  approach  we  would  select  that  attribute  for  which 
the  sum  of  these  costs  was  minimal,  and  so  hope  to  build  a minimum-cost  rule.  Alternatively 
we  could  use  task-specific  Information  to  reject  or  suggest  possible  attributes  in  context. 
For  example  we  could  reject  as  unlikely  the  testing  of  whether  the  knight  could  flee  before 
seeing  whether  It  could  capture  the  rook,  or  we  might  suggest  that  it  is  a good  Idea  to  start  by 
finding  whether  the  king  Is  In  check.  Such  a knowledge-based  approach  would  use  plausibility 
to  guide  the  rule  construction  in  much  the  same  way  that  Meta-DENDRAL  uses  a weak  model 
of  Its  task  domain  to  guide  the  construction  of  a strong  one,  and  would  tend  to  produce  a rule 
that  a human  being  might  find  easy  to  understand. 

For  these  experiments,  though,  we  used  a much  more  straightforward  task-independent 
heuristic  aimed  at  giving  a simple  rule.  (The  choice  of  simplicity  as  the  goal  was  made  because 
such  a rule  Is  more  likely  to  be  general,  and  so  cover  Instances  not  included  In  the  window  from 
which  It  was  formed.)  Suppose  as  above  that  we  are  considering  attribute  X as  the  next  test. 
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If  we  select  It,  we  will  be  left  with  the  residual  task  of  building  trees  for  each  of  the  C<  and  we 
would  like  some  predictor  for  the  complexity  of  these  subtrees.  Each  C<  Is  a collection  of  some 
number  $(  of  Instances  of  class  safe  and  If  of  class  lost.  If  either  of  these  numbers  Is  zero 
then  the  tree  for  the  branch  containing  C<  will  be  a leaf.  If  the  number  In  either  class  Is  small 
then  we  would  expect  that  It  would  not  take  many  more  tests  to  separate  these  Instances 
from  the  rest.  After  some  trial  and  error  we  found  that  an  estimate  of  the  complexity  of  the 

tree  required  for  C*  as  

\Jmlnlmum(si,li) 

appears  to  work  quite  well.  The  total  complexity  of  all  residual  subtrees  If  J.  Is  selected  is 
then  the  sum  of  these  measures  over  the  collections  C<.  At  any  stage,  the  attribute  selected 
as  the  next  node  on  the  tree  Is  one  that  minimizes  this  estimated  total  residual  complexity, 
and  so  the  selection  method  Is  akin  to  a 1-step  lookahead.  (Some  additional  experiments  are 
currently  being  performed  using  an  Information-theoretic  model  of  complexity.) 

This  Induction  algorithm  and  the  iterative  technique  that  uses  It  exist  as  a collection  of 
programs  written  In  a dialect  of  PASCAL  for  the  DEC  1 0 [Kisickl  and  Nagel  1976].  All  execution 
times  appearing  In  this  report  are  for  a KL  10.  The  programs  themselves  were  written  with  an 
eye  to  their  adaptation  for  other  problems;  copies  are  available  from  the  author. 

Results  Using  the  First  Method 

Recall  that  the  first  technique  Involved  selecting  a subset  of  ft  called  the  window,  and 
repeatedly  adding  to  it  exceptions  to  the  rule  developed  to  explain  It,  with  the  restriction 
that  no  more  than  some  fixed  number  of  exceptions  can  be  added  at  each  Iteration.  The  two 
Important  variables  here  are  the  size  of  the  initial  window  and  the  exceptions  limit.  If  these 
numbers  are  very  small  then  many  iterations  will  be  required  to  generate  a correct  rule;  if  too 
large,  then  not  only  will  the  window  grow  rapidly,  but  the  time  taken  by  the  Inductive  algorithm 
to  find  a rule  adequate  to  explain  the  window  at  each  iteration  will  also  Increase.  Two  examples 
with  the  pf &s  data  base  should  illustrate  this.  A run  was  made  with  the  Initial  window  containing 
40  instances  and  the  number  of  exceptions  added  at  each  Iteration  limited  to  20  (l.e.  2%  and 
IV.  of  the  data  base  respectively).  Ten  iterations  were  required  to  generate  a rule  correct 
for  all  of  ft.  Each  was  fairly  short  since  the  window  was  small;  the  total  time  required  was  6.4 
seconds  and  the  final  window  size  was  196  instances.  In  contrast,  when  the  Initial  window 
contained  400  Instances  and  up  to  200  exceptions  could  be  added  at  each  Iteration  (20V. 
and  10%  respectively),  only  four  Iterations  were  needed.  The  total  time  however  was  still  4.1 
seconds  due  to  the  larger  Initial  window  size.  As  a consequence  of  this  larger  Initial  window 
there  were  fewer  exceptions  at  each  iteration  and  the  final  window  contained  460  Instances. 
More  details  of  these  runs  appear  in  Table  1. 

A better  choice  for  the  two  parameters  would  seem  to  lie  somewhere  between  these 
extremes.  A number  of  runs  was  made  with  the  Initial  window  size  ranging  from  60  to  360  and 
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the  exceptions  limit  varying  from  26  to  126.  Each  run  was  repeated  10  times  with  different 
random  choices  of  both  the  Initial  window  and  the  exceptions  added  at  each  Iteration  (when 
the  number  of  the  latter  exceeded  the  limit).  As  before,  three  measurements  are  Important: 

The  total  time  taken  to  find  a correct  rule;  this  gives  a direct  comparison  of  the 
suitability  of  different  combinations  of  the  initial  window  size  and  the  exceptions 
limit. 

The  number  of  Iterations  required;  this  gives  a feel  for  the  rate  of  convergence  of 
the  process. 

The  final  window  size;  this  demonstrates  the  reduction  in  random  access  storage 
that  may  be  achieved  using  the  scheme. 

The  averages  of  these  three  measures  over  each  group  of  ten  runs  Is  shown  in  Tables  2a,  2b 
and  2c.  They  Indicate  that,  for  Intermediate  values  of  the  Initial  window  size  and  exceptions 
limit,  the  first  method  exhibits  quite  consistent  behavior.  A correct  rule  Is  typically  discovered 
In  about  4 Iterations  and  3-3.6  seconds,  with  the  final  window  containing  14%  -17%  of  the 
total  data  base.  This  consistency  Is  quite  Important  to  the  usefulness  of  the  approach.  If  it  had 
turned  out  that  performance  was  very  sensitive  to  the  choice  of  values  for  the  initial  window 
size  and/or  the  exceptions  limit,  then  using  this  method  in  practice  would  entail  many  trials  to 
find  a happy  combination  of  values.  The  substantial  'floor'  covering  most  Intermediate  values 
of  the  tables  Is  reassuring  in  that,  If  it  carries  over  to  other  problems,  a fairly  arbitrary  choice 
of  these  parameters  Is  likely  to  yield  acceptable  results. 

Given  the  complexity  of  the  rule  In  the  appendix,  taking  a little  over  three  seconds  to 
find  It  seems  good,  and  doing  so  in  about  four  iterations  attests  to  the  rapid  convergence  of 
the  scheme.  But  the  primary  Justification  for  the  approach  Is  that  It  requires  less  fast  memory, 
In  this  case  about  one  sixth  of  what  would  be  needed  to  store  the  whole  data  base.  To  place 
this  figure  In  context  (at  least  with  respect  to  the  Induction  algorithm  being  used),  we  tried 
to  produce  a correct  rule  directly  from  large  subsets  of  the  instances.  Although  more  than 
100  such  subsets  containing  from  26%  to  60%  of  the  total  data  base  were  tried,  In  no  case 
was  a rule  found  that  was  adequate  for  all  the  data.  This  contrasts  with  the  performance  of 
the  Iterative  approach,  which  is  able  to  find  a correct  rule  each  time  using  considerably  less 
space. 

Another  result  using  this  method  may  be  of  interest.  If  the  Initial  window  size  and  ex- 
ceptions limit  are  set  very  low,  a correct  rule  Is  (eventually)  obtained  with  a very  compact 
final  window.  For  example,  one  run  with  both  of  these  parameters  set  to  1 found  a correct 
rule  with  a final  window  consisting  of  only  88  instances,  or  about  4%  of  the  total  data  base. 
This  final  window  represents  In  some  sense  a distillation  of  all  the  pf&s  data,  containing  as  It 
does  ail  the  'special'  cases  and  enough  of  the  ordinary  ones  to  Indicate  a correct  rule.  So  this 
technique  may  also  be  an  appropriate  mechanism  for  sifting  large  data  bases  so  as  to  build  a 
tutorial  collection  of  important  cases. 
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Results  With  the  Second  Method 

The  second  method  differs  from  the  first  in  the  way  that  the  new  window  is  formed  at 
each  Iteration.  The  rules  governing  its  composition  are: 

• The  window  sl2e  remains  fixed. 

• At  least  one  instance  corresponding  to  each  subcomponent  of  the  rule  developed 
from  the  old  window  also  appears  in  the  new  one.  In  our  case  the  rule  Is  a decision 
tree,  each  subcomponent  of  which  Is  Identified  with  a leaf.  As  before,  some  leaves 
are  null  in  the  sense  that  no  instances  map  to  them.  So  each  non-null  leaf  In  the  old 
rule  Is  represented  In  the  new  window  by  one  or  more  of  the  Instances  that  mapped 
to  that  leaf. 

• No  more  than  a fixed  number  of  exceptions  can  become  part  of  the  new  window. 
Some  preliminary  experiments  were  carried  out  varying  this  limit  from  33*4  to  60% 
of  the  (fixed)  window  size;  results  were  quite  uniform,  so  the  higher  percentage 
was  chosen. 

The  number  of  exceptions  incorporated  Into  the  new  window  was  then  the  minimum  of 
the  number  of  exceptions  to  the  old  rule, 
half  the  fixed  window  size,  and 

the  window  size  less  the  number  of  non-null  leaves  in  the  old  rule 

and,  as  before,  a random  subset  of  this  size  was  chosen  from  the  exceptions  If  there  were 
too  many  to  Include  them  all.  The  Instances  from  the  old  rule  were  also  selected  randomly; 
first,  one  corresponding  to  each  non-null  leaf  In  the  old  rule,  and  then  an  appropriately  sized 
subset  of  the  remainder. 

The  reason  for  developing  this  method  was  to  ensure  that  the  window  would  not  expand 
to  such  an  extent  that  It  overran  the  storage  available  for  it.  Consider  for  example  the  results 
with  the  first  method  where  the  initial  window  contained  160  instances  and  the  exceptions 
limit  was  76.  The  average  final  window  size  over  the  ten  runs  Is  shown  In  Table  2c  as  297,  but 
in  five  of  those  ten  cases  the  final  window  contained  more  than  300  instances  (the  maximum 
was  343).  If  the  total  space  available  for  storing  the  window  had  been  sufficient  for  only 
300  Instances,  the  first  method  using  these  parameter  values  would  have  been  successful  In 
developing  a correct  rule  In  only  half  the  trials — the  othera  would  have  had  to  be  abandoned 
because  of  Insufficient  space.  In  contrast,  if  the  second  method  were  used  with  a window 
size  of  300  Instances  then  It  would  always  succeed,  because  there  do  exist  collections  of 
300  Instances  from  which  It  Is  possible  to  discover  a correct  rule. 

This  second  method  may  be  viewed  then  as  containing  an  element  of  insurance,  and  as 
such  should  be  expected  to  incur  some  cost  In  terms  of  Increased  times  to  find  a correct  rule. 


13 


As  before,  a number  of  different  values  of  the  only  parameter  (window  size)  was  tried  with  ten 
runs  each;  the  results  are  summarized  in  Table  3.  This  shows  that  any  additional  cost  Is  slight, 
as  a typical  run  here  requires  about  3.5  seconds  and  4 iterations.  Once  again  the  performance 
of  the  Iterative  scheme  Is  quite  stable  for  Intermediate  values  (l.e.  window  sizes  around  16% 
of  the  total  data  base),  but  falls  off  rather  sharply  as  the  window  size  Is  reduced  below  a 
comfortable  level.  The  reason  for  these  Increased  times  would  seem  to  be  that  the  number 
of  subsets  of  a given  size  from  which  a correct  rule  can  be  Induced  declines  rapidly  with  this 
size,  and  so  the  number  of  iterations  necessary  to  find  one  of  them  increases.  On  the  other 
end  of  the  scale,  however,  the  performance  Is  predictable — the  slow  increase  In  time  taken 
represents  the  roughly  linear  additional  effort  required  to  generate  a rule  for  a subset  as  the 
number  of  Instances  In  It  grows. 

Conclusion 

The  pf &s  data  base  has  been  explored  quite  extensively  using  these  techniques.  Although 
It  Is  the  largest  data  base  for  which  a comprehensive  investigation  has  been  conducted,  further 
encouraging  results  have  been  obtained  when  the  Iterative  approach  has  been  applied  to  other 
problems.  In  a recent  trial,  for  example,  another  set  of  29,236  Instances  each  described  by 
26  attributes  was  analysed  using  the  first  method.  A rule  was  found  at  the  first  attempt,  using 
an  Initial  window  size  of  400  and  an  exceptions  limit  of  100.  Twenty  Iterations  were  needed, 
taking  a total  of  394  seconds,  and  the  final  window  contained  2,160  Instances  (or  about  7%  of 
the  data  base).  The  rule  Itself  was  roughly  seven  times  as  complex  as  the  one  for  pins,  forks 
and  skewers  shown  In  the  Appendix.  If  the  difficulty  of  an  induction  task  can  be  approximated 
by  the  product 


rule  complexity  x number  of  Instances  X number  of  attributes 

then  this  problem  was  more  than  two  orders  of  magnitude  harder  than  pf&s.  Comparison  of 
the  time  required  to  solve  each  problem  suggests  that  It  increases  only  linearly  with  the  dif- 
ficulty of  the  Induction  task,  and  experiments  with  tasks  of  intermediate  difficulty  support  this 
conjecture.  This  approach  does  not  seem  to  suffer  from  a combinatorial  explosion  (such  as 
that  noted  In  [Hayes-Roth  and  McDermott  1977])  which  would  prevent  Its  application  to  more 
difficult  problems. 

These  results,  together  with  those  of  the  previous  sections,  demonstrate  that  It  Is  Indeed 
possible  to  generate  a correct  rule  to  explain  a large  collection  of  Instances,  even  when  only 
a small  part  of  this  data  base  can  be  held  in  random  access  memory.  The  Iterative  technique 
discussed,  coupled  with  either  method  of  selecting  the  randomly  accessible  subset  of  the 
data,  seems  to  converge  quickly.  One  main  advantage  of  this  Iterative  technique  Is  that  It 
requires  little  or  no  modification  of  whatever  inductive  algorithm  Is  being  used,  but  only  the 
way  In  which  it  is  used. 
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In  fact  the  results  go  further  than  this.  The  pf&s  Induction  problem  used  as  an  example 
throughout  this  paper  contains  about  two  thousand  Instances.  This  was  small  enough  to  fit 
in  random  access  memory,  so  the  same  inductive  algorithm  used  in  the  above  experiments 
was  applied  to  it  directly.  The  rule  (a  correct  one,  naturally,  since  It  was  derived  from  and 
explained  the  entire  data  base)  was  obtained  in  3.5  seconds.  This  figure  is  marginally  greater 
than  the  time  required  in  many  of  the  trials  using  the  Iterative  technique  with  both  the  first 
and  second  window-selecting  methods,  and  almost  all  of  those  using  intermediate  values  for 
the  parameters  (for  which  the  mean  time  Is  3.3  seconds  with  standard  deviation  0.2).  Thus 
the  techniques  d'scussed  in  this  paper,  which  may  become  a matter  of  necessity  when  dealing 
with  a large  data  base,  may  also  be  desirable  for  reasons  of  efficiency  when  dealing  with  a 
smaller  one. 
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Initial  Window  Size  40  Initial  Window  Size  400 

Exceptions  Limit  20  Exceptions  Limit  200 


Iteration 

Window 

Size 

Number  of 
Exceptions 

Window 

Size 

Number  of 
Exceptions 

1 

40 

176 

400 

27 

2 

60 

266 

427 

18 

3 

80 

117 

446 

6 

4 

100 

346 

460 

- 

6 

120 

131 

6 

140 

43 

7 

160 

64 

8 

180 

16 

9 

106 

1 

10 

196 

- 

time:  6.4  seconds  time:  4.1  seconds 


Table  1 : Detailed  results  for  two  extreme  cases. 
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Initial  Exceptions  Limit 


Window 

Size 

26 

60 

76 

100 

126 

60 

4.9 

4.2 

4.2 

4.2 

4.3 

100 

4.4 

3.7 

3.1 

3.9 

3.6 

160 

3.9 

3.0 

3.1 

3.4 

3.8 

200 

3.4 

3.7 

3.0 

3.2 

3.2 

260 

3.6 

3.4 

3.4 

3.2 

3.3 

300 

4.0 

3.2 

3.4 

3.6 

3.3 

360 

3.9 

3.4 

3.4 

3.6 

3.7 

Table  2a:  Average  time  to  find  correct  rule  (seconds). 


Initial  Exceptions  Limit 


Window 

Size 

26 

60 

76 

100 

126 

60 

8.6 

6.3 

6.1 

6.6 

6.6 

100 

7.1 

6.6 

4.6 

6.1 

4.6 

160 

6.8 

4.3 

4.3 

4.6 

4.6 

200 

4.8 

4.8 

3.9 

4.0 

4.0 

260 

4.6 

4.1 

4.1 

3.9 

4.0 

300 

4.8 

3.6 

3.8 

3.9 

3.8 

360 

4.1 

3.6 

3.6 

3.9 

3.0 

Table  2b:  Average  iterations  to  find  correct  rule. 


Initial  Exceptions  Limit 


Window 

Size 

26 

60 

76 

100 

126 

60 

207 

248 

317 

376 

300 

100 

22  6 

268 

204 

360 

368 

160 

242 

260 

207 

321 

343 

200 

260 

310 

312 

332 

326 

260 

316 

362 

344 

366 

361 

300 

362 

381 

387 

387 

378 

360 

406 

417 

421 

412 

416 

Table  2c:  Average  final  window  size. 
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Window  Size 

Average  Time 
to  Find  Rule 

Average  Iterations 
to  Find  Rule 

200 

10.1 

14.0 

260 

4.6 

6.6 

276 

3.4 

4.1 

300 

3.4 

4.1 

326 

3.3 

3.8 

360 

3.4 

3.6 

3/6 

3.7 

3.9 

400 

3.6 

3.6 

600 

3.9 

3.4 

600 

4.6 

3.4 

Table  3:  Average  performance  over  ten  runs  using  the  second  method. 


Appendix:  A correct  rule  for  the  pins,  forks  and  skewers  Induction  problem. 

This  rule  Is  presented  as  a tree  turned  on  Its  side.  Each  Interior  node  Is  an  attribute  name, 
and  the  branches  corresponding  to  the  various  values  of  the  attribute  appear  underneath  It. 
Each  leaf  is  either  a class  ( lost  or  safe)  or  null. 


The  attribute  names  and  their  meanings  are  as  follows: 


distance  k-n 

distance  k-r 
distance  k-wk 
distance  n-r 
distance  n-wk 
distance  r-wk 
r threatens  k 
r threatens  n 
k can  approach 
k at  p2 
wk  at  p3 
wk  at  p6 
n at  p4 

wk  next  to  r line 


distance  in  king  moves  from  the  black  king  to  the 

black  knight  ('3'  means  'greater  than  2') 

ditto  black  king  to  white  rook 

ditto  black  king  to  white  king 

ditto  black  knight  to  white  rook 

ditto  black  knight  to  white  king 

ditto  white  rook  to  white  king 

white  rook  checks  black  king 

white  rook  threatens  black  knight 

black  king  can  move  adjacent  to  black  knight 

black  king  is  on  an  edge  one  square  from  a corner 

white  king  Is  on  an  edge  two  squares  from  a corner 

white  king  diagonally  two  squares  from  a corner 

black  knight  Is  diagonally  one  square  from  a corner 

white  king  Is  in  a rank  or  file  adjacent  to  the  one 

occupied  by  the  white  rook 
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Rule  for  pins,  linear  forks  and  skewers 


k can  approach 

‘distance  n-wk 

distance  k-r 


t*— * 


fM 


iM 


1t-»  safe 
Tk  at  p2 

fn  at  p4 
t t-tsafe 


2m 


tM 


ft-*  lost 


L it-tlOSt 

n at  p4 

, ft  at  P2 
3m  I tM  tM  safe 

fM  lost 
L ft-»/osf 

2 Msafe 
3Msafe 

distance  k-r 

distance  r-wk 
wk  at  p3 


1m 


2m 


3m 


k at  p2 

1 m | tM  tMsafe 
, ft-*  lost 
ft-t  lost 
2t-tsafe 
3t-+  safe 

distance  r-wk 
fwk  at  p3 

k at  p2 


1m 


tM 


tM 


Tdistance  n-r 
ft-*  lost 
2t-*null 
3m  safe 
1m  lost 


L ft-tlost 
2t-tlost 
3 t-tlost 

wk  at  p3 

distance  k-wk 
k at  p2 


tM 


2m 


tM 


tM 


[wk  next  to  r line 
[distance  n< 

1 t-tlost 

2 t-tnull 
L 3 t-tsafe 

L ft-tlost 
ft-tlost 


wk 


L 3 t-tlost 
ft-tlost 
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